There are folium maps in this notebook. If they do not display or do not display well, we suggest you to have a look at the .html file that is located the same repository.
Are countries that plant more maize richer than countries planting rice? Is it true that developed countries produce more meat? Are you better off being a food net exporter or importer? Are food prices more stable if you produce more food locally or trade more?
In this project we analyze the effects that a country agricultural sector has on its different economic indicators. The indicators of the agricultural sector we used are crops and livestock production, exports and imports of crops, livestock and live animals. For these, we use the data from the "Global Food & Agriculture Statistics" datasets. We quantify the economic success by Gross Domestic Product (GDP), but also by price stability, as defined by low changes in Consumer Price Indices (CPI). We further use the Food and Agriculture Organization (FAO) definition of food self-sufficiency to analyze its link to economic success and stability. After finding the results of the agricultural products most highly linked with economic success, we create visualizations in the form of maps. Through these timeline maps, we show how the production/export/import of important products has developed globally. We also use maps to visualize the level of food self-sufficiency and price stability.
We would like to work on the following research questions:
External imports:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import folium
import seaborn as sns
import json
import re
import requests
from bs4 import BeautifulSoup
from ipywidgets import interact
from IPython.display import display
import scipy.cluster.hierarchy as spc
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge
from operator import itemgetter
from sklearn import preprocessing
from tqdm import tqdm
Setup:
data_folder_path = "Scripts/Data/current_FAO/raw_files/"
files = {"Crops production" : "Production_Crops_E_All_Data_(Normalized).csv",
"Food trade" : "Trade_Crops_Livestock_E_All_Data_(Normalized).csv",
"Consumer price indices" : "ConsumerPriceIndices_E_All_Data_(Normalized).csv",
"Macroeconomy" : "Macro-Statistics_Key_Indicators_E_All_Data_(Normalized).csv",
"Livestock production" : "Production_Livestock_E_All_Data_(Normalized).csv",
"Live animals trade" : "Trade_LiveAnimals_E_All_Data_(Normalized).csv"
}
interesting_datasets = files.keys()
In this part, we will load, explore and clean the dataset in order to remove typing errors, missing information, inaccuracies, and so on.
Our main dataset is a subset of the "Global Food & Agriculture Statistics" that is found in the proposed datasets list. In this dataset, we have seen that we could work with the production as well as import and export quantities per year and per country. As far as food is concerned, we use crops, livestock and live animals. We have also found pieces of information about countries GDP and CPI in this database.
This database countains several files. We had a look of all the files. For food-related data about countries, we decided to focus on the following files:
Production_Crops_E_All_Data_(Normalized).csv contains data about crops production.Trade_Crops_Livestock_E_All_Data_(Normalized).csv contains data about food trade (crops and livestock).Production_Livestock_E_All_Data_(Normalized).csv contains data about livestock production.Trade_LiveAnimals_E_All_Data_(Normalized).csv contains data about live animals trade.For food-related data about countries, we decided to focus on the following files:
ConsumerPriceIndices_E_All_Data_(Normalized).csv contains data about consumer price indices (CPI).Macro-Statistics_Key_Indicators_E_All_Data_(Normalized).csv contains data about gross domestic product (GDP) along with other macroeconomic indicators.def load_datasets(datasets) :
df = {}
for dataset in datasets :
file_path = data_folder_path + files[dataset]
df[dataset] = pd.read_csv(file_path, encoding = "ISO-8859-1")
return df
We load each interresting dataset in the dictionary df :
df = load_datasets(interesting_datasets)
In this part, we will have a first look of the datasets in order to get a first sense of the data.
def display_df(df, datasets):
for dataset in datasets :
display(dataset, df[dataset].sample(5))
In order to see what does the datasets look like, we display a sample of 5 rows for each of them :
display_df(df, interesting_datasets)
At first glance, our datasets seem very clean.
Each of our dataset contains a column "Year" and a column that is named "Area". This is a great news for us since we want to do a both geographical and time-related analysis.
The column "Area" correspond to the country except it may contains a group of country (e.g. "Eastern Europe").
In this part, we will clean the datasets. The final goal is to produce one uniformized dataset on which we could work (see 1.F.).
In a very simplistic way, such a cleaned and uniformized dataset may look like this :
Country | Year | GDP | CPI | Food production features | Food trade features
Extracting crops harvested area, production, seed and yield from the "Crops production" dataset
Extracting stocks production from the "Livestock production" dataset
Extracting import and export quantities from the "Live animals trade" and "Crops trade" datasets
Extracting average CPI of each year from the "Consumer price indices" dataset
In this section, we will create dataframes in df_useful which correspond to previous dataframes without the unuseful data.
df_useful = {}
The "Macroeconomy" dataset contains many different measures: Gross Fixed Capital Formation, Gross National Income, Value Added (Total Manufacturing), ... We are only interested in Gross Domestic Product. Therefore, we extract it Gross Domestic Product from the "Macroeconomy" dataset. In order to have uniformisation among values, we choose the US$ value. All of them have the same unit (millions US\\$) so we can drop the "Unit" column as well.
def extract_GDP(df):
def selection_GDP(df):
return df['Item']=='Gross Domestic Product'
def selection_US_dollars(df):
return df['Element']=="Value US$"
def drop_columns(df):
dropped_colmuns = ["Item Code", "Item", "Element Code", "Element", "Flag", "Year Code", "Unit"]
return df.drop(columns = dropped_colmuns)
return drop_columns(df[selection_GDP(df)&selection_US_dollars(df)])
df_useful["GDP"] = extract_GDP(df["Macroeconomy"])
We can have have a look at a sample of the extrated dataset:
display(df_useful["GDP"].sample(5))
And we can plot GDP in million US$ for different countries for the period 1970-2015:
select_switzerland = df_useful["GDP"]['Area']=='Switzerland'
select_france = df_useful["GDP"]['Area']=='France'
select_austria = df_useful["GDP"]['Area']=='Austria'
select_canada = df_useful["GDP"]['Area']=='Canada'
ax = df_useful["GDP"][select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = df_useful["GDP"][select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful["GDP"][select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful["GDP"][select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('GDP in million US$ for different countries for the period 1970-2015')
For dissolute or new countries, we have some Nan values (before appearing or after dissolution) as in this next example :
select_USSR = df_useful["GDP"]['Area']=='USSR'
select_russia = df_useful["GDP"]['Area']=='Russian Federation'
select_ukraine = df_useful["GDP"]['Area']=='Ukraine'
ax = df_useful["GDP"][select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = df_useful["GDP"][select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful["GDP"][select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('GDP in million US$ for different countries for the period 1970-2015')
We want to extract crops harvested area, production, seed and yield from the "Crops production" dataset. As all crops are not food crops, we request the World crops database to keep only the food crops.
def get_food_crops():
#Return a list of crops categorized as food crops https://world-crops.com/food-crops/
url="https://world-crops.com/food-crops/"
r=requests.get(url,headers={"User-Agent": "XY"})
soup=BeautifulSoup(r.text,'html.parser')
elements_temp=soup.find_all('a',href=re.compile("^../"))
elements=[el.text for el in elements_temp]
#only 40 elements are displayed on each page->iterating on the total list
for i in range(40,401,40):
url_i=url+"?ss="+str(i)
r=requests.get(url_i,headers={"User-Agent":"XY"})
soup=BeautifulSoup(r.text,'html.parser')
new_elements=soup.find_all('a',href=re.compile("^../"))
elements+=[el.text for el in new_elements]
return elements
def inclusive_search(string,elements):
#returns true if the string can be found in elements. The search removes special characters from string in order to include more positive results
string=string.lower()
delimiters = ",", "(","&",")"," and "," "
pattern = '|'.join(map(re.escape, delimiters))
strings=list(filter(None,re.split(pattern,string)))
found=False
for s in strings:
if s=="nes":
continue
for el in elements:
found=(s in el.split())
if found==False and s[-1]=="s":
found=s[:-1] in el.split()
if found==False and s[-2:]=="es":
found=s[:-2] in el.split()
if found==False and s[-3:]=="ies":
found=s[:-3]+"y" in el.split()
if found==True:
return found
return found
def get_food_crop_data(df):
#extracts the food crop data, returns 4 df: Area,Production,Seed and yield
df=df.copy()
food_crops=list(map(lambda x: x.lower(),get_food_crops()))
crop_types_df=df[['Item','Value']].groupby('Item').sum()
crop_types_df=crop_types_df[list(map(lambda x : inclusive_search(x,food_crops) , crop_types_df.index ))]
food_crop_df=df[df.Item.apply(lambda x: x in crop_types_df.index)]
return (food_crop_df[food_crop_df.Element=='Area harvested'],
food_crop_df[food_crop_df.Element=='Production'],
food_crop_df[food_crop_df.Element=='Seed'],
food_crop_df[food_crop_df.Element=='Yield'])
food_crop_area_df , food_crop_production_df , food_crop_seed_df , food_crop_yield_df = get_food_crop_data(df["Crops production"])
df_useful['Crops Area harvested'] = food_crop_area_df.drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
df_useful['Crops Production'] = food_crop_production_df.drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
df_useful['Crops Seed'] = food_crop_seed_df.drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
df_useful['Crops Yield'] = food_crop_yield_df.drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
We check everything is fine by looking at samples for each of the new dataframes:
display('Crops Area harvested', df_useful['Crops Area harvested'].sample(5))
display('Crops Production', df_useful['Crops Production'].sample(5))
display('Crops Seed', df_useful['Crops Seed'].sample(5))
display('Crops Yield', df_useful['Crops Yield'].sample(5))
We also make some plots to have a first understanding of the dataset:
select_Maize = df_useful['Crops Area harvested']['Item']=='Maize'
maize_df = df_useful['Crops Area harvested'][select_Maize]
select_switzerland = maize_df['Area']=='Switzerland'
select_france = maize_df['Area']=='France'
select_austria = maize_df['Area']=='Austria'
select_canada = maize_df['Area']=='Canada'
ax = maize_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Crops Area harvested in ha for different countries for the period 1970-2015')
select_USSR = maize_df['Area']=='USSR'
select_russia = maize_df['Area']=='Russian Federation'
select_ukraine = maize_df['Area']=='Ukraine'
ax = maize_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Crops Area harvested in ha for different countries for the period 1970-2015')
We want to extract stocks production from the "Livestock production" dataset. Again, we drop the columns that are useless for us and have a first look of the data with a sample and some plots.
selection_stocks = df['Livestock production']["Element"] == 'Stocks'
df_useful['Livestock production'] = df['Livestock production'][selection_stocks].drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
display(df_useful['Livestock production'].sample(5))
select_pigs = df_useful['Livestock production']['Item']=='Pigs'
pigs_df = df_useful['Livestock production'][select_pigs]
select_switzerland = pigs_df['Area']=='Switzerland'
select_france = pigs_df['Area']=='France'
select_austria = pigs_df['Area']=='Austria'
select_canada = pigs_df['Area']=='Canada'
ax = pigs_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Pigs production in heads for different countries for the period 1970-2015')
select_USSR = pigs_df['Area']=='USSR'
select_russia = pigs_df['Area']=='Russian Federation'
select_ukraine = pigs_df['Area']=='Ukraine'
ax = pigs_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Pigs production in heads for different countries for the period 1970-2015')
Now, we extract import and export quantities from the "Live animals trade" and "Crops trade" datasets, having again some samples and some plots.
selection_import_quantities = df['Live animals trade']["Element"] == 'Import Quantity'
selection_export_quantities = df['Live animals trade']["Element"] == 'Export Quantity'
df_useful['Live animals import quantities'] = df['Live animals trade'][selection_import_quantities].drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
df_useful['Live animals export quantities'] = df['Live animals trade'][selection_export_quantities].drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
display(df_useful['Live animals import quantities'].sample(5))
select_pigs = df_useful['Live animals import quantities']['Item']=='Pigs'
pigs_df = df_useful['Live animals import quantities'][select_pigs]
select_switzerland = pigs_df['Area']=='Switzerland'
select_france = pigs_df['Area']=='France'
select_austria = pigs_df['Area']=='Austria'
select_canada = pigs_df['Area']=='Canada'
ax = pigs_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Pigs importation in heads for different countries for the period 1970-2015')
select_USSR = pigs_df['Area']=='USSR'
select_russia = pigs_df['Area']=='Russian Federation'
select_ukraine = pigs_df['Area']=='Ukraine'
ax = pigs_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Pigs importation in head for different countries for the period 1970-2015')
display(df_useful['Live animals export quantities'].sample(5))
select_pigs = df_useful['Live animals export quantities']['Item']=='Pigs'
pigs_df = df_useful['Live animals export quantities'][select_pigs]
select_switzerland = pigs_df['Area']=='Switzerland'
select_france = pigs_df['Area']=='France'
select_austria = pigs_df['Area']=='Austria'
select_canada = pigs_df['Area']=='Canada'
ax = pigs_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Pigs exportation in heads for different countries for the period 1970-2015')
select_USSR = pigs_df['Area']=='USSR'
select_russia = pigs_df['Area']=='Russian Federation'
select_ukraine = pigs_df['Area']=='Ukraine'
ax = pigs_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = pigs_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = pigs_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Pigs exportation in heads for different countries for the period 1970-2015')
selection_import_quantities = df['Food trade']["Element"] == 'Import Quantity'
selection_export_quantities = df['Food trade']["Element"] == 'Export Quantity'
df_useful['Food import quantities'] = df['Food trade'][selection_import_quantities].drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
df_useful['Food export quantities'] = df['Food trade'][selection_export_quantities].drop(columns=['Item Code', "Element Code", "Element", "Year Code", "Flag"])
display(df_useful['Food import quantities'].sample(5))
select_Maize = df_useful['Food import quantities']['Item']=='Maize'
maize_df = df_useful['Food import quantities'][select_Maize]
select_switzerland = maize_df['Area']=='Switzerland'
select_france = maize_df['Area']=='France'
select_austria = maize_df['Area']=='Austria'
select_canada = maize_df['Area']=='Canada'
ax = maize_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Maize importation in tonnes for different countries for the period 1970-2015')
select_USSR = maize_df['Area']=='USSR'
select_russia = maize_df['Area']=='Russian Federation'
select_ukraine = maize_df['Area']=='Ukraine'
ax = maize_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Maize importation in tonnes for different countries for the period 1970-2015')
display(df_useful['Food export quantities'].sample(5))
select_Maize = df_useful['Food export quantities']['Item']=='Maize'
maize_df = df_useful['Food export quantities'][select_Maize]
select_switzerland = maize_df['Area']=='Switzerland'
select_france = maize_df['Area']=='France'
select_austria = maize_df['Area']=='Austria'
select_canada = maize_df['Area']=='Canada'
ax = maize_df[select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Maize exportation in tonnes for different countries for the period 1970-2015')
select_USSR = maize_df['Area']=='USSR'
select_russia = maize_df['Area']=='Russian Federation'
select_ukraine = maize_df['Area']=='Ukraine'
ax = maize_df[select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = maize_df[select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = maize_df[select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('Maize exportation in tonnes for different countries for the period 1970-2015')
The "Consumer price indices" dataset contains monthly data. In order to have a uniform dataframe, and as other dataframes have yearly data, we group it by (Country, Year) and compute the mean over the monthes.
df_useful['Consumer price indices'] = df['Consumer price indices'][['Area',"Year",'Value']] \
.dropna() \
.groupby(['Area',"Year"]) \
.mean() \
.reset_index() \
.dropna()
With samples and plots, we remark that this dataset only start in 2000 wheareas other ones start in 1970.
display(df_useful['Consumer price indices'].sample(5))
select_switzerland = df_useful['Consumer price indices']['Area']=='Switzerland'
select_france = df_useful['Consumer price indices']['Area']=='France'
select_austria = df_useful['Consumer price indices']['Area']=='Austria'
select_canada = df_useful['Consumer price indices']['Area']=='Canada'
ax = df_useful['Consumer price indices'][select_switzerland].plot(x ='Year', y='Value', kind = 'line')
ax = df_useful['Consumer price indices'][select_france].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful['Consumer price indices'][select_austria].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful['Consumer price indices'][select_canada].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["Switzerland", 'France', 'Austria', "Canada"])
_ = ax.set_title('Consumer price indices in % for different countries for the period 1970-2015')
select_russia = df_useful["Consumer price indices"]['Area']=='Russian Federation'
select_ukraine = df_useful["Consumer price indices"]['Area']=='Ukraine'
ax = df_useful["Consumer price indices"][select_russia].plot(x ='Year', y='Value', kind = 'line')
ax = df_useful["Consumer price indices"][select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(['Russia', 'Ukraine'])
_ = ax.set_title('Consumer price indices in % for different countries for the period 1970-2015')
Having a more detailled look at the dataset, we have remarked that the areas which are real countries are exactely the ones with an "Area Code" below $5000$ but not in $[261, 269]$.
#remove Area code >= 5000 or in [261, 269] (EU)
for df_name in df_useful :
if 'Area Code' in df_useful[df_name].keys() :
print ("Removing areas which are not countries in", df_name)
selection_countries = df_useful[df_name]['Area Code']<261
selection_countries = selection_countries | (df_useful[df_name]['Area Code']>269)
selection_countries = selection_countries & (df_useful[df_name]['Area Code']<5000)
df_useful[df_name] = df_useful[df_name][selection_countries]
display(df_useful[df_name].sample(5))
else :
print (df_name, "is already clean")
In this section, we will explain how we will handle the missing data in previous dataframes for maps.
select_USSR = df_useful["GDP"]['Area']=='USSR'
select_russia = df_useful["GDP"]['Area']=='Russian Federation'
select_ukraine = df_useful["GDP"]['Area']=='Ukraine'
ax = df_useful["GDP"][select_USSR].plot(x ='Year', y='Value', kind = 'line')
ax = df_useful["GDP"][select_russia].plot(x ='Year', y='Value', kind = 'line', ax = ax)
ax = df_useful["GDP"][select_ukraine].plot(x ='Year', y='Value', kind = 'line', ax = ax)
_ = ax.legend(["USSR", 'Russia', 'Ukraine'])
_ = ax.set_title('GDP in million US$ for different countries for the period 1970-2015')
In order to vizualize folium maps, we need to associate each country a value. The geojson file that we use is not timestamped and only countries that exist nowadays are inside it. As some countries has been dissolved during the past 50 years, our folium maps won't be complete. For instance, we do not have any value for Ukraine from 1970 to 1989. Our idea to fix this issue is presented in the next paragraph.
Our idea is to map the former country value to each of the current ones. For instance in 1982, USSR GDP is around one trillion \$. Therefore, if we associate (only for folium map purposes) this value to each current country that succeeded USSR, all these countries will appear the same color in the folium map, i.e. all the USSR area will appear the same color (and the good one).
In order to do so, one need to identify which countries appeared and disappeared from the dataset and at which year. Then we will use this result along with some historical research in our visualise_world_data_folium function (1.E.a.).
countries_formation_years = {}
for country in df_useful["GDP"]["Area"].unique():
selection = df_useful["GDP"]["Area"] == country
year_in, year_out = df_useful["GDP"][selection].dropna()["Year"].min(), df_useful["GDP"][selection].dropna()["Year"].max()
for year in (year_in, year_out):
if year not in countries_formation_years :
countries_formation_years[year] = []
countries_formation_years[year_in].append((country,'+'))
countries_formation_years[year_out].append((country,'-'))
countries_formation_years.pop(1970)
countries_formation_years.pop(2015)
for year in sorted(list(countries_formation_years)):
print (year, countries_formation_years[year])
In this part, we will finish prepocessing the datasets. More precisely, we will deal with country names and normalizing the features.
Some countries have different names in the geojson file and in the dataset. We first start by correcting them.
def correct_country_names(old_name, dic):
if old_name in dic.keys() :
return dic[old_name]
return old_name
dic = {'Czechia': "Czech Republic",
'Russian Federation':'Russia',
"Serbia":"Republic of Serbia",
'The former Yugoslav Republic of Macedonia':'Macedonia',
'China, mainland':'China',
'Viet Nam':'Vietnam',
'Venezuela (Bolivarian Republic of)':'Venezuela',
'Iran (Islamic Republic of)':'Iran',
'Syrian Arab Republic':"Syria",
'Bolivia (Plurinational State of)': 'Bolivia',
"Côte d'Ivoire": "Ivory Coast",
'Congo':"Republic of the Congo",
"Lao People's Democratic Republic":'Laos',
"Democratic People's Republic of Korea":"North Korea",
'Republic of Korea':"South Korea"}
for df_name in df_useful :
print (df_name)
df_useful[df_name]["Area"] = df_useful[df_name]["Area"].apply(lambda x : correct_country_names(x,dic))
Then, we do a function that takes as input a dataframe and a year and produces the corresponding folium map. This function also handles dissolutions of countries as suggested before.
def visualise_world_data_folium(df, year, logScale=True):
dic = {'USSR': ['Armenia', 'Azerbaijan','Belarus', 'Estonia', 'Georgia',
'Kazakhstan', 'Kyrgyzstan', 'Latvia', 'Lithuania',
'Montenegro', 'Republic of Moldova', 'Russia',
'Republic of Serbia', 'Timor-Leste', 'Turkmenistan', 'Ukraine',
'Uzbekistan'],
'Ethiopia PDR': ['Eritrea','Ethiopia'],
'Yugoslav SFR': ['Kosovo', 'Slovenia', 'Croatia',
'Macedonia', 'Bosnia and Herzegovina'],
'Yemen Dem' : ['Yemen'],
'Czechoslovakia': ["Czech Republic", 'Slovakia'],
'Netherlands Antilles (former)': ['Curaçao', 'Sint Maarten (Dutch Part)'],
'Sudan (former)': ['South Sudan', 'Sudan']
}
to_plot=df[df["Year"]==year]
to_plot=(to_plot[['Area','Value']]
.dropna()
.groupby('Area')
.mean()
.reset_index()
.dropna())
to_plot['Area']=to_plot['Area'].apply(lambda x : correct_country_names(x, dic))
to_plot = to_plot.explode('Area')
if logScale :
to_plot.Value=np.log10(1+to_plot.Value)
m = folium.Map(location=[40,-10],zoom_start=1.6)
folium.Choropleth(
geo_data=f"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/world-countries.json",
data=to_plot,
columns=['Area', 'Value'],
key_on='feature.properties.name',
fill_color='YlGn',fill_opacity=0.7,line_opacity=0.2,nan_fill_opacity=0.0
).add_to(m)
folium.LayerControl().add_to(m)
return(m)
We can know use it to produce some maps. For instance, we plot below the map of GDP for the year 1985 (before dissolution of USSR) and 1995 (after).
display(visualise_world_data_folium(df_useful["GDP"], 1985, True))
display(visualise_world_data_folium(df_useful["GDP"], 1995, True))
Some of our features seem to be right skewed. At first glance it seems that they look like power laws.
For instance the distribution of GDP look a bit like a power law:
_ = sns.distplot(df_useful["GDP"]["Value"], rug=False, hist=False)
As we later want to train some Machine Learning models, we log those values so that their distribution look a bit more like a normal distribution.
#looks better with log scale
_ = sns.distplot(np.log(df_useful["GDP"]["Value"]), rug=False, hist=False)
The new distribution indeed looks better to train models on it.
In this part, we will make one uniformized dataframe uni_df with the following columns.
Country | Year | GDP | Crops production columns | Livestock production columns | Crops importation columns | Livestock importation columns | Crops exportation columns | Livestock exportation columns | CPI
In this uniformized dataframe, a tuple (Country, Year) uniquely identifies a row.
The current dataframes have several rows for a given (Country, Year). Each of this row correspond to one item. We would like to have a unique row for a given (Country, Year) and one column per item:
need_pivot = ['Crops Area harvested',
'Crops Production',
'Crops Seed',
'Crops Yield',
'Livestock production',
'Live animals import quantities',
'Live animals export quantities',
'Food import quantities',
'Food export quantities']
def rename_columns(x, word):
if x not in ['Area', 'Year', 'ha', 'tonnes', 'hg/ha', 'Head', '1000 Head']:
return x + ' ' + word
return x
df_useful['GDP'] = df_useful['GDP'].rename(columns = {'Value':'(GDP, million $)'})[["Area",'Year','(GDP, million $)']]
df_useful['Consumer price indices'] = df_useful['Consumer price indices'].rename(columns = {'Value':'(Consumer price indices, %)'})[["Area",'Year','(Consumer price indices, %)']]
for df_name in need_pivot :
df_useful[df_name] = pd.pivot_table(df_useful[df_name], index=["Area",'Year'], columns=["Item","Unit"], values="Value").rename(columns=lambda x: rename_columns(x, df_name))
display(df_name, df_useful[df_name].sample(5))
Some Nan values have appeared. After some analysis, we have conclude to replace those Nan values by zeros. Indeed, it seems that those Nan values means that the value were very low and not significant to be measured.
# Deal with the NaN that appeared
for df_name in df_useful :
if df_name != "GDP":
for column in list(df_useful[df_name]):
if column not in ['Area', 'Year']:
df_useful[df_name][column].fillna(0, inplace=True)
#removing the multiindex, so that merge is clean with GDP and CPI
for df_name in need_pivot :
df_useful[df_name].columns = [' '.join([str(_) for _ in v]) for v in df_useful[df_name].columns.values]
display(df_useful[df_name].sample(5))
We are now creating the uniformized dataframe uni_df. Each row corresponds to one tuple (Country, Year) so that we can later group by country or year. In addition to the country ("Area") and the "Year", the columns will either be an economic feature ("GDP", "CPI") or an agricultural one (some crop harvested area, some livestock export quantity, ...). With this uniformized dataframe, we can later on analyze correlations and links between different feautures place and yearwise. This means we can measure a correlation of e.g. GDP with the production of a specific crop over all countries and all years.
uni_df = df_useful['GDP'].dropna()
for df_name in need_pivot :
uni_df = pd.merge(uni_df, df_useful[df_name], how='left', on=['Area', 'Year'])
uni_df = pd.merge(uni_df,df_useful['Consumer price indices'], how='left', on=['Area', 'Year'])
# Deal with the NaN that appeared
for column in list(uni_df):
if column not in ['Area', 'Year']:
uni_df[column].fillna(0, inplace=True)
uni_df.sample(30)
In this part, we will explore the dataset with more details. We will first see in more detail the food productions, importations and exportations, next examine the Consumer price indices, then study the structure of international trade and historical context and finally look at the economic classification of countries.
For the next milestone, we will also produce some maps, showing for instance the production of a specific crop per country over the years. We have shown in previous parts that the dataset contains the necessary data and that we can handle the data in its size and plot maps.
In this section we will present and compute the notion of food self-sufficiency. We will use the quantitative definition of the Food and Agriculture Organization (FAO).
One may wonder how to know whether a country produces all the food it needs or not. The notion of food-self-sufficency allows to answer to this question. More formally, it is a rate that decribes how to which degree a country can satisfy to meet its internal consumption needs by production. It describes the extent to which a country is able to feed its population through its domestic food production. We are interested in this measure since we thinkit could be correlated with the economic conditions of this country, particularly price stability. Price stability is defined in the next part.
In order to compute the food self-sufficiency, we will apply the following formula that gives us the food self-sudfficiency as a percentage :
$$\frac{Production \times 100}{Production + Imports – Exports}$$The following is a trial calculation of self-sufficiency. Refining which agriculutral products should go into this calculation still need to be done for next milestone. Indeed with our first calculations it seems that the self-sufficiency is always lower than 100% whereas this should not be the case.
#We calculate food self sufficiency for the most commonly produced and traded crop (by mass) which is "cereals".
production_columns = list(uni_df.filter(like="Cereals (Rice Milled Eqv) Crops Production tonnes"))
import_columns = list(uni_df.filter(like="Cereals Food export quantities tonnes"))
export_columns = list(uni_df.filter(like="Cereals Food import quantities tonnes"))
uni_df[('All productions','tonnes')] = 0
for column in production_columns :
uni_df[('All productions','tonnes')] += uni_df[column]
uni_df[('All imports','tonnes')] = 0
for column in import_columns :
uni_df[('All imports','tonnes')] += uni_df[column]
uni_df[('All exports','tonnes')] = 0
for column in export_columns :
uni_df[('All exports','tonnes')] += uni_df[column]
uni_df[('food self-sufficiency','%')] = 100 * uni_df[('All productions','tonnes')] / (uni_df[('All productions','tonnes')]+uni_df[('All imports','tonnes')]+uni_df[('All exports','tonnes')])
display(uni_df[['Area','Year',('food self-sufficiency','%')]].sample(5))
plot = uni_df[['Area','Year']]
plot["Value"] = uni_df[('food self-sufficiency','%')]
for year in range(1980, 2010, 5):
display(year, visualise_world_data_folium(plot, year, False))
Consumer price indices (CPI) are a way to measure the changes of the average price level of goods. Typically a "basket of consumer goods and services" is used to calculate average consumer prices each year. Then, the relative change of these prices is used as a measure of inflation or deflation over a period of time. More technically, for a given item, the CPI is the ratio of the market basket for two different years. Global CPI is an average of sigle item CPI with some standardized weights. The FAO dataset includes the consumer prices, food indices. This means we have information about countries food price stability over the years.
The CPI has many uses and is often taken into consideration. For instance it is used for budget and pension revisions, monetary and economic policies, and economic analysis. It is a good indicator of relative price stability, which is essential for development and economic safety. The european central banks main objective is price stability in the euro-zone of keeping the consumer price index below a growth of 2% per year.
We will use the CPI to answer the following questions: "Are prices more stables in more self-sufficient countries ?", "Is there a link between the CPI and other agricultural features ?"
Our dataset contains data for the historical period from 1970 to 2015. In order to be able to correctly interpret the results we are going to see, we first made a brief historical research on this period. Listed below are important events of this period for which we think they have had a significant influence on the agriculture and the economy.
There was the Cold war from 1945 to 1990 with two economic superpowers (USA and USSR). The USSR had been dissolved in 1991. The Japanese economic miracle occured from 1945 to 1990 and allowed Japan to come out of the disastrous state in which it was at the exit of the WW2 and become one of the worlds largest economies. There have been 2 big oil crises, in 1973 and 1979. There have been many wars (Middle East wars 1973-2000 e.g. Yom Kippur War 1973, Islamic Revolution in Iran 1979, Iran–Iraq war 1980-1988, Gulf war 1990-1991, Yugoslav wars 1991-2001...). We have already seen some consequences of such events by dealing with countries names in a previous section.
The third Agricultural Revolution (also known as Green revolution) occurs form 1960 to 1990 and improved agricultural productions thanks to fertilizers and chemicals.
The following public-domain image from Wikimedia represents developed countries (blue), developing ones (orange) and least developed ones (red) according to the United Nations and International Monetary Fund. We expect to see similar results with our dataset (GDP).
![]()
The following image, also from Wikimedia shows the cumulative commercial balance for the period 1980-2008. We also expect to see similar results with our dataset, but there might be difference as we focus on agriculture.
![]()
In order to have an idea of the international trade and economy structure, we are interested in GDP:
pivoted_GDP_df = uni_df[['Area','Year']]
pivoted_GDP_df["GDP"] = uni_df["(GDP, million $)"]
pivoted_GDP_df = pivoted_GDP_df.pivot_table(index='Year', columns='Area', values="GDP").dropna(axis=1)
pivoted_GDP_df.sample(5)
As we can see on a subset of the correlation matrix below, GDP are often hugely correlated between countries.
selected_countries = ['Algeria', 'Australia', 'Austria', 'Bangladesh', 'China',
'Djibouti', 'France', 'Germany', 'India', 'Japan', 'Mali',
'Switzerland', 'United States of America']
corr = pivoted_GDP_df[selected_countries].corr()
corr.style.background_gradient(cmap='coolwarm')
The correlation matrix contains lots of values that are very closed to one (red). This is also true for the whole correlation matrix as seen below:
f = plt.figure(figsize=(19, 15))
plt.matshow(pivoted_GDP_df.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params()
plt.title('Correlation Matrix', fontsize=16);
We then try to clusterize this correlation matrix in order to find countries whose GDP are correlated:
corr = pivoted_GDP_df.corr().values
pdist = spc.distance.pdist(corr) # vector of ('55' choose 2) pairwise distances
linkage = spc.linkage(pdist, method='complete')
ind = spc.fcluster(linkage, 0.32*pdist.max(), 'distance')
columns = [pivoted_GDP_df.columns.tolist()[i] for i in list((np.argsort(ind)))]
clusterised_df = pivoted_GDP_df.reindex(columns, axis=1)
f = plt.figure(figsize=(19, 15))
plt.matshow(clusterised_df.corr(), fignum=f.number)
cb = plt.colorbar()
cb.ax.tick_params()
plt.title('Correlation Matrix', fontsize=16);
We have found regions in which the GDP is highly correlated and between which the correlation coefficent is lower. We could refine the big clusters by iterating this method.
Interpretation: The correlation matrix of GDP contains lots of values that are very closed to one. This means that GDP in two different countries have a trend to evolve the same way. Therefore, we can say that the world countries have strong enough trading relations to make the GDP evolve the same way. The fact that we have found some main clusters could be interpreted as regions in which the trading relations are more important.
Below, we plot the distribution of GDP over the world for the last decades:
plot = uni_df[['Area','Year']]
plot["Value"] = uni_df["(GDP, million $)"]
for year in range(1980, 2015, 5):
display(year, visualise_world_data_folium(plot, year, True))
The countries with high GDP indeed correpond to the most developped countries. The trends we can observe from those plots look very significant (USSR dissolution, China economic growth, ...)
For the next milestone, this part will be put to the end of the file and conducted for specific agricultural features (e.g. trade of crop most highly correlated to GDP) after the most important ones have been identified.
We also plan for next milestone to analyse in more details the correlation between food self-sufficiency and economic success.
uni_df[('food self-sufficiency','%')].corr(method='pearson', other = uni_df['(Consumer price indices, %)'])
uni_df[('food self-sufficiency','%')].corr(method='pearson', other = uni_df["(GDP, million $)"])
Out of the crops and the livestock produced, we will focus on the top 20 crops as well as top 20 livestock produced globally to be able to have a model with a reasonable amount of features.
#Choose the top 20 crops produced in the world for the first model
crop_production_df= uni_df.filter(like='Crops Production tonnes')
crop_production_df = crop_production_df.sum(axis=0).sort_values(ascending=False)
crop_production_list = crop_production_df.index.tolist()[:20]
crop_production_list.insert(0,'(GDP, million $)')
crop_production_list
#Take the top 20 crops and the columns of our uni_df which contain their production
maincrops_df = uni_df[crop_production_list]
#Livestock
livestock_production_df= uni_df.filter(like='Livestock production')
livestock_production_df = livestock_production_df.sum(axis=0).sort_values(ascending=False)
livestock_production_list = livestock_production_df.index.tolist()[:20]
livestock_production_list.insert(0,'(GDP, million $)')
livestock_production_list
#Take the top 20 crops and the columns of our uni_df which contain their production
mainlivestock_df = uni_df[livestock_production_list]
top_production_list = crop_production_list +livestock_production_list[1:]
top_production_df = uni_df[top_production_list]
top_production_df.head(5)
#Checking correlations of main crops between each other and with GDP
top_production_correlation_matrix = round (top_production_df.corr(method='pearson'),3)
top_production_correlation_matrix['(GDP, million $)'].sort_values(ascending = False)
Because "Cereals (Rice Milled Eqv) Crops Production tonnes" has such a high correlation with a lot of other features, it is probably an aggregate of them (eg over 90% with wheat).
We would know like to look at some relationships between these measures and the GDP:
#Looking at some relationships
for item in list(top_production_df.columns)[1:]:
top_production_df.plot(kind='scatter', x=item, y='(GDP, million $)', grid=True)
We can quite clearly see that the production of many of the most popular crops can well be related to GDP.
Let's see if the same can be said about the most exported/imported goods.
#Choose the top 20 most exported items by mass
top_exports_df= uni_df.filter(like='export quantities tonnes')
top_exports_df = top_exports_df.sum(axis=0).sort_values(ascending=False)
top_exports_list =top_exports_df.index.tolist()[:20]
top_exports_list
#Take the top 20 exported crops and the columns of our uni_df which contain their production
top_exports_list.insert(0,'(GDP, million $)')
top_exports_df = uni_df[top_exports_list]
#Choose the top 20 most imported items by mass
top_imports_df= uni_df.filter(like='import quantities tonnes')
top_imports_df = top_imports_df.sum(axis=0).sort_values(ascending=False)
top_imports_list =top_imports_df.index.tolist()[:20]
top_imports_list
#Take the top 20 imported items and the columns of our uni_df which contain their production
top_imports_list.insert(0,'(GDP, million $)')
top_imports_df = uni_df[top_imports_list]
#Create a joined list
top_traded_list = top_exports_list +top_imports_list[1:]
top_traded_df = uni_df[top_traded_list]
top_traded_df.sample(5)
We now check the correlations of main traded goods with GDP:
top_traded_correlation_matrix = round (top_traded_df.corr(method='pearson'),3)
top_traded_correlation_matrix['(GDP, million $)'].sort_values(ascending = False)
It seems that the highest correlations can be found from exports of meat as well as feeding stuff.
We would know like to look at some relationships between these measures and the GDP:
for item in list(top_traded_df.columns)[1:]:
top_traded_df.plot(kind='scatter', x=item, y='(GDP, million $)', grid=True)
First, we only create a model using only the production data. Afterwards, we use the data on exports, imports.
_ = sns.distplot(top_production_df["Maize Crops Production tonnes"], rug=False, hist=False)
_ = sns.distplot(np.log10(top_production_df["Maize Crops Production tonnes"]), rug=False, hist=False)
We notice that the production of goods, just as the GDP, has a more "normal looking" distribution when we use the logscale. We will thus create a new uniformed dataframe, with applying the logarithm to all values.
#Using 1+x so as to keep 0 values to 0.
def log(x):
return np.log10(1+x)
#create a new dataframe with log values, so that we have normal distributions for later analysis
uni_df_log = uni_df.copy()
uni_df_log.iloc[:,2:] = uni_df_log.iloc[:,2:].apply(lambda x : log(x))
top_production_log_df = uni_df_log[top_production_list]
top_production_log_df.sample(5)
#We then normalize the data, so as to have comparable ranges. We use the dataframe of log values.
top_production_values = top_production_log_df.values
standard_scaler = preprocessing.StandardScaler()
top_production_stand_values =standard_scaler.fit_transform(top_production_values)
top_production_stand = pd.DataFrame(top_production_stand_values, columns=top_production_log_df.columns)
top_production_stand.sample(5)
train_set, test_set = train_test_split(top_production_stand.values, test_size = 0.2, random_state = 1)
X_train_set = train_set[:,1:]
Y_train_set = train_set[:,0]
X_test_set = test_set[:,1:]
Y_test_set = test_set[:,0]
number_of_folds = 5
scores = []
list_of_alpha = [i for i in np.arange(0,10,0.01)]
for alpha in tqdm(list_of_alpha):
clf = Ridge(alpha = alpha)
score = cross_val_score(clf, X_train_set, Y_train_set, cv=number_of_folds, scoring = 'neg_mean_squared_error')
scores.append([alpha, score.mean()])
a=np.array(scores)
best_alpha = a[np.where(a==np.amax(a[:,1]))[0]][0,0]
print("The best value obtained is for alpha equal to " + str(best_alpha) + " with a MSE of "+ str(-a[np.where(a==np.amax(a[:,1]))[0]][0,1]))
alphas = [elt[0] for elt in scores]
MSE = [-elt[1] for elt in scores]
sns.lineplot(alphas, MSE)
_ = plt.title("Cross validation score")
_ = plt.ylabel("Mean Squared Error")
_ = plt.xlabel("alphas")
# We build our model with the chosen alpha.
model_top_production = Ridge(alpha=best_alpha)
model_top_production.fit(X_train_set, Y_train_set)
weights_top_production = pd.DataFrame([model_top_production.coef_], columns=top_production_df.columns[1:])
weights_top_production = weights_top_production.sort_values(by=0, axis=1, ascending=False)
weights_top_production
model_top_production.score(X_train_set, Y_train_set)
Results are so far middling, but it also need cleaning since there's some aggregate values (e.g. including "total" in their name, that need to be removed). We will continue working on this for next milestone.
#Use dataframe with log values
top_traded_log_df = uni_df_log[top_traded_list]
#We then normalize the data, so as to have comparable ranges. We use the dataframe of log values.
top_traded_values = top_traded_log_df.values
standard_scaler = preprocessing.StandardScaler()
top_traded_stand_values =standard_scaler.fit_transform(top_traded_values)
top_traded_stand = pd.DataFrame(top_production_stand_values, columns=top_traded_log_df.columns)
top_traded_stand.sample(5)
train_set, test_set = train_test_split(top_traded_stand.values, test_size = 0.2, random_state = 1)
X_train_set = train_set[:,1:]
Y_train_set = train_set[:,0]
X_test_set = test_set[:,1:]
Y_test_set = test_set[:,0]
number_of_folds = 5
scores = []
list_of_alpha = [i for i in np.arange(0,10,0.01)]
for alpha in tqdm(list_of_alpha):
clf = Ridge(alpha = alpha)
score = cross_val_score(clf, X_train_set, Y_train_set, cv=number_of_folds, scoring = 'neg_mean_squared_error')
scores.append([alpha, score.mean()])
a=np.array(scores)
best_alpha = a[np.where(a==np.amax(a[:,1]))[0]][0,0]
print("The best value obtained is for alpha equal to " + str(best_alpha) + " with a MSE of "+ str(-a[np.where(a==np.amax(a[:,1]))[0]][0,1]))
alphas = [elt[0] for elt in scores]
MSE = [-elt[1] for elt in scores]
sns.lineplot(alphas, MSE)
_ = plt.title("Cross validation score")
_ = plt.ylabel("Mean Squared Error")
_ = plt.xlabel("alphas")
# We build our model with the chosen alpha.
model_trade = Ridge(alpha=best_alpha)
model_trade.fit(X_train_set, Y_train_set)
weights_trade = pd.DataFrame([model_trade.coef_], columns=top_traded_stand.columns[1:])
weights_trade = weights_trade.sort_values(by=0, axis=1, ascending=False)
weights_trade
model_trade.score(X_train_set, Y_train_set)
This models also needs to be cleaned, because the one with lowest weights is actually a part of the one with highest.
But it seems we can already draw some conclusions: cake, soybeans, as well as oilseed cake meal food and fodder & Feeding stuff are all used as animal feed. The fact that they're among the features with the highest weights clearly indicates that there's a link between GDP and high amounts of imports of animal feed.
In this part, we plan to draw for next milestone a similar analysis with CPI as with GDP in previous section.
With clean dataframes we can now really focus on producing some interesting results. Our initial idea was to observe the effects of different crops and food items on the economic growth of the countries as well as the differences in self-sufficiency. Along the exploration of the data, we turned our attention toward prediction models. Training a Ridge model on our data will allow us to identify the agricultural products that are the most correlated with the economic growth of countries (predict the GDP based on agricultural features). The second model (prediction of the CPI variation rate) will allow us to identify products linked with economic stability.
The identification of such items would give an interesting insight toward understanding geopolotical strategies and challenges. Further insight might be gained by identifying who the producers of these "economically strong" crops and animal products are and visualising the geographical repartition of the most important ressources.
Our objectives for the following weeks are: